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1  Introduction 

This  paper  describes  and  analyzes  experiments  we  performed  for  the  Medical  Records  track  in  the  2012  Text 
REtrieval  Conference  (TREC).  We  mainly  investigated  three  research  problems: 

1.  Evidence  Aggregation:  In  last  year’s  track  there  were  two  different  methods  in  general  for  obtaining  a 
visit  ranking  out  of  reports  (smaller  document  units),  i.e.,  (A)  using  reports  as  indexing  and  retrieval 
units  and  then  converting  a  report  ranking  into  a  visit  ranking,  and  (B)  using  visits  as  indexing  and 
retrieval  units  by  concatenating  reports  at  the  very  first  stage  and  then  obtain  a  visit  ranking  directly. 

Method  A  avoids  the  potential  problem  of  varying  visit  document  length,  while  Method  B  naturally 
aggregates  evidence  scatter  over  multiple  reports  and  easily  obtains  a  visit  ranking.  It  is  unclear  which 
method  is  better  based  on  all  reported  results.  Thus,  we  compared  the  two  approaches,  tried  various 
score  aggregation  methods  for  (A) ,  and  combined  both  approaches  in  a  way  that  further  improved  the 
system  performance. 

2.  Expansion  Sources:  We  tested  a  variety  of  external  collections  (ranging  from  general  web  datasets 
to  domain-specific  thesauri,  and  from  Megabyte  datasets  to  Terabyte  datasets)  for  query  expansion, 
compared  their  effectiveness,  and  obtained  useful  insights  into  the  data. 

3.  Retrieval  Models:  We  tested  several  statistical  IR  models  (proven  to  be  effective  on  news  and  web 
collections)  on  this  medical  collection,  and  explored  ways  to  combine  these  models  to  address  different 
aspects  of  task.  For  instance,  we  used  MRF  model  to  model  term  proximity  since  most  medical  concepts 
are  phrases.  We  also  used  a  mixture  of  relevance  models  to  obtain  various  relevant  expansion  terms 
covered  by  different  expansion  collections  respectively,  which  is  expect  to  significantly  alleviate  the 
vocabulary  mismatch  between  medical  terminologies. 

For  TREC  submissions,  we  tested  systems  that  combined  multiple  IR  models,  leveraged  diverse  expansion 
sources,  and  used  various  evidence  aggregation  methods.  We  implemented  all  the  retrieval  models  in  the 
Indri1  retrieval  system. 


1http : //www. lemur project . org/indri/ 
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2  System  Features 

In  this  section,  we  describe  the  retrieval  models  and  evidence  aggregation  strategies. 

2.1  Retrieval  Model 

In  this  section,  we  briefly  describe  the  three  retrieval  models,  namely  the  query  likelihood  language  model, 
MRF  model,  and  mixture  of  relevance  models,  which  serve  as  the  underpinning  of  our  retrieval  systems. 

Baseline  Model 

Our  baseline  model  is  the  query  likelihood  language  model  as  formulated  below: 

score(-D,  Q)  =  log P(Q\D)  =  log  ^  |C|  ,  (1) 

i=i  \U\^P 

where  qi  is  the  ith  term  in  query  Q,  n  is  the  total  number  of  terms  in  Q,  \D\  and  \C\  are  the  document  and 
collection  lengths  in  words  respectively,  tfqi:D  and  tfquc  are  the  document  and  collection  term  frequencies 
of  qi  respectively,  and  fi  is  the  Dirichlet  smoothing  parameter. 

Markov  Random  Field  Model 

Medical  queries  usually  contain  phases  that  describe  conditions,  symptoms,  drug  names,  treatments,  etc. 
These  query  terms  are  likely  to  occur  in  close  proximity  to  each  other  in  relevant  documents.  Thus,  we  use 
the  Markov  random  field  (MRF)  model  proposed  by  Metzler  and  Croft  [9]  to  model  term  dependencies.  We 
use  their  sequential  dependence  model  in  particular.  Following  Metzler  and  Croft  [9],  we  set  the  feature 
weights  (AT,Ao,Af/)  to  (0.8,  0.1,  0.1). 

Mixture  of  Relevance  Models 

Queries  specified  by  the  search  users  can  have  a  “vocabulary  mismatch”  with  the  content  in  a  medical  report 
since  there  are  many  different  ways  to  express  a  medical  concept  (e.g.,  “hearing  loss”,  “hearing  impairment”, 
“difficult  of  hearing” ,  and  even  “deafness”  are  all  semantically  related,  but  they  only  have  one  common  term 
at  the  most).  The  consequence  is  that  the  system  may  have  a  relatively  low  recall  if  there  is  a  “vocabulary 
mismatch” .  We  can  alleviate  this  problem  and  improve  our  baseline  retrieval  model  by  expanding  the  query 
with  additional  “related”  terms.  These  related  terms  (also  called  expansion  terms)  can  be  derived  from  a 
relevance  model  6q  ,  which  is  usually  built  upon  top-ranked  k  documents  for  the  query  in  the  target  collection 
(i.e.,  the  same  collection  used  for  retrieval). 

Thus,  in  this  paper  we  derive  expansion  terms  based  on  their  weights  p  which  are  estimated  by: 

Pi  =  T exp{  —rfry  +  log  +  score(Dj,Q)},  (2) 

\Di\  dJei,C 

where  score(Dj,Q)  is  the  query  likelihood  score  for  the  top  jth  feedback  document  in  the  initial  retrieval  set 
ranked  by  Equation  1,  tfeuu.  is  the  term  frequency  of  e*  in  document  Dj ,  dfeuc  is  the  document  frequency 
of  ei  in  collection  C,  and  \Dj\  and  \C\  are  document  and  collection  lengths  in  words  respectively.  This 
formula  estimates  the  importance  of  term  based  on  its  term  frequency,  inverse  document  frequency,  and 
feedback  document  scores,  m  terms  with  highest  scores  p  are  selected  as  expansion  terms,  and  they  form 
our  estimated  relevance  model  0q.  Note  that  we  also  normalize  p  so  that  we  have  an  estimated  probability 
P(w\0q)  for  each  word  w. 

Relevance  modeling  can  be  further  improved  upon  by  leveraging  information  in  other  document  collec¬ 
tions.  Specifically,  following  Diaz  and  Metzler  [2],  we  can  form  relevance  models  for  two  or  more  additional 
collections,  then  expand  the  query  using  those  models. 

To  achieve  better  performance,  we  linearly  interpolate  the  mixture  of  relevance  models  with  the  maximum 
likelihood  (ML)  query  estimate  by  formulating  the  equation: 

Piw\eQ)  =  xQtMl  +  j2xcPH0Q,c),  (3) 
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where  the  first  part  is  the  weighted  ML  query  estimate  for  word  w  and  the  second  part  represents  the  mixture 
of  relevance  models.  In  particular,  P(w\0q:c)  is  the  probability  of  w  in  the  estimated  relevance  model  0  built 
upon  top-ranked  documents  in  expansion  collection  C.  A’s  are  collection  weights  and  A q  +  ^ c  Xc  =  1.  For 
TREC  submissions,  we  set  A’s  to  (0.7,  0.1,  0.1,  0.1)  and  use  top  10  terms  from  top  50  feedback  documents. 
We  implement  Equation  3  using  Indri  in  the  same  way  as  our  previous  work  [12].  We  denoted  this  model  as 
MRM. 

A  Combined  Model 

We  linearly  combine  MRF  and  MRM  to  get  our  third  retrieval  model.  The  scoring  function  looks  like 

PHM  =  Aq-MRF  +  ^AcP(H0Q)C),  (4) 

c 

which  is  similar  to  Equation  3.  The  difference  is  that  we  replace  the  ML  query  estimate  with  MRF.  The 
new  retrieval  model  is  expected  to  benefit  from  term  dependence  modeling  as  well  as  query  expansion. 


baseline/MRF/MRM  models 

Figure  1:  Merging  results  from  two  different  retrieval  methods. 


2.2  Multi-level  Evidence 

In  the  aforementioned  retrieval  models,  “document”  (i.e.,  D )  is  a  broad  term  that  could  indicate  different 
granularities  of  a  visit:  the  text  of  all  reports  associated  with  the  visit,  a  single  report  from  the  visit,  or  just 
one  field  within  a  single  report  from  the  visit.  Thus,  in  this  section  we  describe  how  we  leverage  evidence 
from  each  of  these  to  come  up  with  a  final  document  score. 

Field  Level  Evidence 

The  main  fields  in  a  report  are  the  doctor’s  notes  and  the  fields  that  contain  diagnosis  codes.  Here  we  describe 
how  we  leverage  ICD-9  codes  in  the  language  model,  and  how  we  remove  some  extraneous  information  and 
extract  useful  information  from  doctor’s  notes. 

Code  Expansion:  We  expand  ICD  codes  in  the  “admit  diagnosis”  and  “discharge  diagnosis”  fields  with 
their  corresponding  descriptions2  to  introduce  additional  useful  words  into  the  documents.  We  refer  this 
feature  as  ICD  in  the  following  sections. 

Negation  Removal:  The  “report  text”  filed  contains  clinical  narratives.  One  distinct  feature  of  clinical 
narratives  is  that  negation  phrases  are  frequently  used  to  claim  the  absence  of  certain  conditions  or  symptoms 
[1],  such  as  “cannot  tell”,  “not  clear”,  “without  evidence”,  etc.  Negations  may  cause  retrieval  false  positives. 
Thus,  we  use  NegEx3  [4],  an  open-source  clinical  negation  detection  tool,  to  remove  all  negated  portions 
of  the  sentences  from  the  medical  records  before  indexing.  We  refer  this  feature  as  NEG  in  the  following 
sections. 

Age/Gender  Filtering:  We  use  simple  regular  expressions  to  search  for  age/gender  indication  words 
and  phrases  in  both  the  “report  text”  filed  and  the  topics.  We  use  the  extracted  age  and  gender  information 

2https : //drchrono . com/public_billing_code_search 

3http : //code . google . com/p/negex/ 
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to  filter  from  the  retrieval  set  visits  that  do  not  meet  the  inclusion  criteria  specified  in  the  topics.  We  refer 
this  feature  as  AGF  in  the  following  sections. 

Previous  work  [7,  8,  3,  12]  have  shown  that  the  above  three  medical  features  are  quite  useful.  Thus,  they 
would  be  the  default  features  for  our  systems  unless  otherwise  specified. 

Report  Level  Evidence 

Evidence  in  a  visit  may  mainly  exist  in  only  a  small  proportion  of  all  the  associated  reports.  This  allows 
us  to  rely  on  the  strongest  evidence  of  a  visit  to  estimate  its  relevance.  Thus,  we  use  reports  as  the  initial 
retrieval  units  (i.e.,  building  an  index  for  reports  and  applying  the  retrieval  model  to  each  report),  and 
then  transform  a  report  ranking  into  a  visit  ranking  based  on  the  strongest  report-level  evidence,  which  is 
equivalent  to  using  the  following  report  score  merging  method  for  ranking  visits: 

scoreRbM ( V,  Q )  =  /RbM( {score {r\ ,  Q),  score(r^ ,  Q),  ...}),  (5) 

where  rj  is  a  report  associated  with  visit  V  based  on  the  report-to- visit  mapping,  scor e(rJ,Q)  is  the 
language  modeling  score  of  the  report  with  respect  to  query  Q,  and  /RbM  is  the  function  for  aggregating 
the  scores.  We  will  try  MAX,  SUM,  and  ANZ  for  /RbM  in  Section  3.  We  name  this  evidence  aggregation 
strategy  Retrieval-before-Merging  (RbM).  The  merging  process  involved  in  RbM  corresponds  to  “merging 
I”  in  Figure  1. 

Visit  Level  Evidence 

Evidence  may  also  spread  across  multiple  reports,  especially  when  the  information  need  is  a  complex  one. 
Thus,  our  second  strategy  to  aggregate  evidence  is  to  first  merge  reports  from  a  single  visit  field  by  field 
into  a  visit  document  and  then  construct  an  index  for  visit  documents.  With  this  strategy,  the  language 
model  built  on  a  merged  document  can  naturally  combine  the  evidence  scattered  across  multiple  reports. 
Furthermore,  this  strategy  can  directly  lead  to  a  ranking  of  visits  which  are  the  desired  retrieval  units. 
We  call  this  second  evidence  aggregation  strategy  Merging-before- Retrieval  (MbR).  The  merging  process 
involved  in  MbR  corresponds  to  “merging  II”  in  Figure  1. 

Top-level  Evidence 

RbM  and  MbR  as  described  above  are  two  different  strategies  for  aggregating  evidence  and  ranking  visits. 
RbM  and  MbR  complement  each  other  in  that  the  former  can  naturally  aggregate  evidence  spreading  across 
multiple  reports  (which  would  be  challenging  to  do  at  the  report-level)  while  the  latter  can  leverage  the 
strongest  evidence  (which  may  become  less  apparent  after  reports  merging  in  MbR)  to  estimate  relevance. 
This  leads  to  our  third  evidence  aggregation  method  in  which  we  take  advantage  of  both  RbM  and  MbR 
by  merging  their  visit  rankings,  as  demonstrated  by  “merging  III”  in  Figure  1.  We  call  third  strategy  as 
Visit-Ranking-Merging  (VRM).  The  merging  method  (i.e.,  “Merging  III”  in  Figure  1)  is  defined  by: 

scoreVRM  (U,  Q)  =  /vrm  (scoreRbM  (U,  Q) ,  scoreMbR(V,  Q)) ,  (6) 

where  scoreRbM(V)  and  scoreMbR(V)  are  the  language  modeling  scores  for  visit  V  with  respect  to  query  Q  in 
the  two  visit  rankings  obtained  by  RbM  and  MbR  respectively,  /VRM  is  the  function  for  score  aggregation, 
and  scoreVRM(U,  Q)  is  the  final  score  of  visit  V  in  the  merged  ranking.  We  will  try  different  methods  for 
/vrm  such  as  CombMNZ,  CombSUM,  and  CombMAX  in  Section  3  below. 

2.3  Experimental  Setup 

We  explore  the  best  system  feature  settings  for  2012  TREC  submission  by  performing  cross-validation  on  the 
test  collection  of  2011  medical  records  track.  Once  we  find  the  best  settings  as  will  be  described  in  Section  3, 
we  train  the  systems  on  the  whole  2011  test  collection  and  generate  runs  for  2012  topics. 

We  use  the  Indri  retrieval  system  for  indexing  and  retrieving.  In  particular,  we  use  the  Porter  stemmer 
to  stem  words  in  both  reports  and  queries,  and  use  a  simple  standard  medical  stoplist  [5]  for  stopping  words 
in  queries  only.  Then  we  conduct  5-fold  cross-validation  and  use  top  1000  retrieved  visits  for  each  query  to 
evaluate  our  system  under  different  settings.  In  each  iteration,  we  train  our  system  on  28  queries  to  obtain 
the  best  parameter  setting  for  MAP  by  sweeping  over  the  range  of  [1000,  20000]  at  a  step  size  of  1000  for  the 
Dirichlet  smoothing  parameter  (i.e.,  fi  in  Equation  1),  and  then  generate  a  ranking  for  each  of  the  remaining 
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7  queries  based  on  the  trained  system.  When  complete,  we  have  full  rankings  for  all  35  topics  as  a  test  set. 
We  evaluate  the  system  based  on  the  MAP,  bpref,  P10,  and  Rprec  over  all  35  topics. 

We  train  our  systems  on  MAP  though  bpref  is  the  primary  evaluation  metric  for  2011  medical  track. 
There  are  two  reasons:  1)  training  on  MAP  is  most  commonly  used  in  IR  to  improve  retrieval  performance;  2) 
we  find  that  training  on  MAP  improves  the  retrieval  performance  on  other  metrics  as  well  while  training  on 
bpref  does  not  improve  the  overall  performance.  Thus,  MAP  and  bpref  will  both  be  the  primary  evaluation 
measures  in  this  work.  In  fact,  MAP  correlates  well  with  bpref  as  we  will  show  in  the  next  section. 

To  access  the  statistical  significance  of  differences  in  the  performance  of  two  systems,  we  perform  one- 
tailed  paired  t-test  for  MAP  (since  we  train  systems  on  MAP). 


3  Experiments  and  Results 

This  section  describes  experiments,  presents  the  evaluation  results,  and  discusses  the  research  findings. 


3.1  Retrieval  before  Merging 

As  mentioned  in  Section  2.2,  we  have  several  options  for  choosing  the  score  merging  function  /RbM  in  Equa¬ 
tion  5  (i.e.,  “merging  I”  in  Figure  1)  for  RbM.  Now  we  describe  them  formally  below: 


MAX: 


scoreRbM(U,  Q)  =  max({score(r]/,  Q)}) 


SUM: 


scoreRbM(U,  Q)  =  ^scor  e(rJ,Q) 
j 


ANZ: 


scoreRbM(U,  Q)  = 


Yjj  score (rj7 ,  Q) 
({score (rY,Q)  ±  0} | 


where  again  score  (v}7,  Q)  is  the  language  modeling  score  of  the  report  rj  (associated  with  visit  V)  with 
respect  to  query  Q.  ANZ  stands  for  “Averaging  over  Non-Zeros” ,  meaning  we  only  consider  reports  containing 
at  least  one  query  term.  MAX,  SUM,  and  ANZ  are  similar  to  CombMAX,  CombSUM,  and  CombANZ 
proposed  by  Fox  and  Shaw  [11].  However,  CombMAX,  CombSUM  and  CombANZ  were  used  for  merging 
multiple  retrieval  runs. 

Table  1  shows  that  MAX  is  superior  to  SUM  and  ANZ.  This  confirms  our  assumption  that  we  can  rely 
on  the  strongest  evidence  (i.e,  the  most  relevant  report)  of  a  visit  to  estimate  the  relevance  of  that  visit. 
Thus,  we  will  use  MAX  for  score  merging  in  RbM  by  default  in  this  paper. 


MAX  (selected)  SUM  ANZ 

MAP 

0.416  0.110  0.317 

Table  1:  Comparison  of  score  merging  methods  for  RbM. 


3.2  Evidence  Aggregation 

Similarly,  we  also  have  several  options  for  choosing  the  score  merging  function  fv RM  in  Equation  6  (i.e.,“ 
merging  III”  in  Figure  1)  for  VRM,  such  as  CombMNZ,  CombANZ,  and  CombMAX  [11].  In  our  case,  we 
are  only  merging  two  rankings.  Thus,  these  merging  methods  are  specified  as  follows: 


CombMNZ: 


scoreVRM(U,  Q)  =  Nv  ■  [scoreRbM(V,  Q)  +  scoreMbR(V,  Q)\ 


CombSUM: 


scoreVRM(U,  Q)  =  scoreRbM(V,  Q)  +  scoreMbR(V,  Q) 
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CombMAX: 


scoreVRM  (V,  Q )  =  max(scoreRbM(U,  Q),  scoreMbR(U,  Q)) 


CombANZ: 


scoreVRM(^  Q)  = 


scoreRbM(U,  Q)  +  score 

MbR  {v,Q) 

Nv 


where  scoreVRM  (V,  Q)  is  the  merged  score  for  visit  V,  and  scoreRbM  (V,  Q)  and  scoreMbR(V,  Q)  are  the  scores 
for  V  in  two  different  visit  rankings  as  demonstrated  in  Figure  1,  and  Ny  is  the  number  of  rankings  that  have 
V  in  the  top  1200  retrieved  visits  (We  cut  off  the  merged  rank  list  at  rank  1000  to  get  the  final  ranking).  Note 
that  scoreMbR/RbM  (V,  Q)  =  0  if  V  does  not  appear  in  the  top  1200  retrieved.  We  compare  the  performance  of 
these  merging  methods  using  the  primary  evaluation  measures  in  Table  2.  As  we  can  see,  CombMNZ  and 
CombSUM  achieve  comparable  performance,  and  are  better  than  CombMAX  and  CombANZ.  Thus,  we  can 
infer  that  a  good  aggregation  strategy  for  “merge  III”  should  favor  visits  that  appear  in  both  rankings.  We 
use  CombMAX  and  CombANZ  as  the  merging  methods  for  VRM. 


Method 

MAP  bpref 

CombMNZ  (selected) 
CombSUM  (selected) 
CombMAX 

CombANZ 

0.446  0.564 

0.446  0.563 

0.427  0.559 

0.356  0.510 

Table  2:  Comparison  of  score  merging  methods  for  VRM. 

Next,  we  compare  the  three  evidence  aggregation  strategies  as  described  in  Section  2.2.  Table  3  shows 
that  VRM  is  significantly  better  than  MbR  and  RbM  on  MAP,  which  means  that  merging  visit  rankings  as 
the  top-level  evidence  aggregation  strategy  boosts  the  retrieval  performance  significantly. 


System 

MAP 

bpref 

P10 

Rprec 

MbR 

0.393 

0.530 

0.565 

0.403 

RbM 

0.416 

0.551 

0.594 

0.434 

VRM 

0.446a 

0.563 

0.635 

0.456 

Table  3:  Comparison  of  evidence  aggregation  methods.  A  indicates  that  the  MAP  difference  between  VRM 
and  MbR/RbM  is  statistically  significant  (p  <  0.05). 


3.3  Selection  of  Expansion  Collections 

We  test  several  expansion  collections.  In  addition  to  the  medical  records  that  are  the  target  of  retrieval, 
we  leverage  information  in  several  other  large,  widely- available  collections:  ImageCLEF  2009  Medical  Image 
Retrieval  Task  dataset  [10],  TREC  2007  Genomics  Track  dataset  [6],  TREC  2009  ClueWeb09  Category  B 
dataset  (excluding  Wikipedia  pages),  a  Wikipedia  dataset  (containing  those  excluded  Wikipedia  pages),  and 
the  2012  Medical  Subject  Headings  (MeSH).  Table  4  provides  detailed  information  about  these  datasets. 
In  particular,  we  use  MeSH  for  expansion  in  the  way  as  described  in  [13].  Moreover,  the  CLEF  dataset 
consists  of  74,902  medical  images.  We  crawled  5,704  full-text  CLEF  articles  associated  with  these  images 
as  the  actual  external  collection  used  in  this  work.  We  choose  these  collections  because  there  are  existing 
topics  and  relevance  judgments  for  analysis  and  because  we  want  to  compare  the  effects  of  different  sources 
on  retrieval  performance. 

For  simplicity,  we  use  aggregation  strategy  MbR  (without  any  medical  features  described  in  Section  ) 
and  retrieval  model  MRM  with  one  expansion  collection  at  a  time  to  explore  the  expansion  effectiveness  of 
each  collection  as  show  in  Table  5. 

As  we  can  see  in  Table  5,  ImageCLEF  and  Wikipedia  have  comparable  improvement  over  the  base¬ 
line,  though  the  former  is  more  medical-related,  much  smaller,  and  less  noisy  than  the  latter.  The  same 
situation  applies  to  Genomics  and  ClueWeb09.  However,  Genomics  and  ClueWeb09  are  much  larger  than 
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Collection 

#  documents 

vocabulary  size 

avg  doc  length 

Medical* 

100,866 

105 

423 

MeSH 

n/a 

n/a 

n/a 

ImageCLEF 

5,704 

105 

6,495 

Genomics 

162,259 

107 

6,595 

Wikipedia 

5,957,529 

106 

1,305 

ClueWeb09 

44,262,894 

107 

756 

Table  4:  Collection  Statistics. 


System 

MAP 

Significance 

bpref 

P10 

Baseline  (B) 

0.353 

0.469 

0.506 

ImageCLEF  (/) 

0.371  (+5.1%) 

0.492 

0.544 

Wikipedia  (W) 

0.376  (+6.5%) 

0.500 

0.550 

ClueWeb09  (C) 

0.390  (+11%) 

>{B} 

0.513 

0.556 

MeSH  (S) 

0.391  (+11%) 

>{B,  1} 

0.496 

0.547 

Medical  (M) 

0.393  (+11%) 

>{B} 

0.520 

0.535 

Genomics  (G) 

0.395  (+12%) 

>{B,  W] 

0.524 

0.553 

Table  5:  Evaluation  of  query  expansion.  UX  >  S ”  means  the  MAP  difference  between  system  X  and  any 
system  specified  in  set  S  is  statistically  significant.  The  statistical  significance  is  determined  using  one-tailed 
paried  t-test  on  queries  and  p- value  <  0.05. 


runID 

Features 

Scores 

MRF 

MRM 

VRM 

MAP 

infAP 

infNDCG 

Rprec 

P10 

Genomics+Medical+ClueWeb 

MeSH 

udelSUM 

v 

V 

CombSUM 

0.413 

0.286 

0.578 

0.419 

0.592 

udelMNZ 

v 

v 

CombMNZ 

0.412 

0.285 

0.576 

0.418 

0.594 

udelMRF 

v 

CombMNZ 

0.408 

0.280 

0.572 

0.415 

0.604 

udelMED 

CombMNZ 

0.398 

0.269 

0.564 

0.410 

0.590 

Table  6:  Feature  settings  and  results  for  TREC  submissions. 


ImageCLEF  and  Wikipedia  respectively,  and  Genomics  and  ClueWeb09  both  have  significant  improvement 
over  the  baseline.  Genomics  is  also  significantly  better  than  Wikipedia.  Thus,  we  can  infer  that  expansion 
effectiveness  depends  on  both  the  quality  (i.e.,  content  similarity  to  the  target  collection)  and  size  of  the 
expansion  collection. 

MeSH  expansion  is  different  from  general  expansion  in  that  it  relies  on  a  controlled  vocabulary  from 
which  expansion  terms  derived  are  not  as  diversified  as  those  from  a  general  expansion  collection.  For 
instance,  for  the  query  “hearing  loss”,  it  is  difficult  for  MeSH  to  provide  related  expansion  terms  such  as 
“cochlear”,  “noise”,  “auditory”,  and  “binaural”  (top-ranked  terms  from  Genomics),  “cerumen”,  “canals”, 
and  “tympanic”  (from  Medical),  “vestibular”,  “ear”,  and  “stape”  (from  ImageCLEF).  Some  of  these  terms 
do  appear  in  the  MeSH  trees  at  upper  levels,  however,  it  is  hard  to  find  a  link  to  them,  i.e.,  discriminating 
them  from  other  unrelated  tree  nodes.  Simply  including  all  visited  concepts  along  the  path  is  likely  to 
cause  query  drift.  Moreover,  these  terms  normally  appear  in  phrase  concepts  having  different  meanings  than 
individual  terms. 

MeSH  expansion  is  quite  restrictive,  yet  is  comparable  to  top  performing  single  expansions  and  is  signifi¬ 
cantly  better  than  the  baseline  and  ImageCLEF.  This  is  most  likely  because  our  MeSH  expansion  emphasizes 
modeling  term  proximity  which  is  a  big  advantage  of  any  medical  thesaurus-based  expansion  over  general 
expansion.  Another  merit  of  MeSH  expansion  is  that,  if  used  properly,  it  rarely  includes  bad  expansion 
terms,  while  we  have  no  control  of  the  quality  of  each  expansion  term  from  general  expansions. 

Based  on  Table  5,  we  choose  Genomics,  Medical,  MeSH,  and  ClueWeb09  (i.e.,  the  top-performing  expan¬ 
sion  collections)  for  our  MRM  model. 

3.4  TREC  Submissions  and  Results 

Base  on  all  the  previous  investigations,  we  select  and  combine  multiple  features  for  our  TREC  submissions  as 
shown  in  Table  6.  The  settings  for  udelMRF  and  udelMED  are  for  evaluating  the  impact  of  MRF  and  MeSH 
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1.2 


3/47  <  median 


4/47  <  median 


-infNDCG-Best 


1.2 


-^PlO-Best 


1 

0.8 

0.6 

0.4 

0.2 

0 


-■-PlO-Median 

-A-P10-Worst 

-*-P10-udelSUM 


(a)  udelSUM  is  below  TREC  medians  for  3/47  top¬ 
ics  on  infNDCG. 


(b)  udelSUM  is  below  TREC  medians  for  4/47  top¬ 
ics  on  P10. 


Figure  2:  Comparison  with  TREC  results. 


udelMNZ 

udelMRF 

udelMED 

udelSUM 

0.1635 

0.0190 

0.0005 

udelMNZ 

- 

0.0335 

0.0008 

udelMRF 

- 

- 

0.0181 

Table  7:  Pairwise  one-tail  paired  t-test  on  infAP 


respectively.  Table  6  also  shows  the  evaluation  scores  averaged  over  47  official  topics.  We  pick  udelSUM, 
the  system  with  the  highest  MAP  score,  for  further  analysis.  Figure  2  shows  the  comparison  of  infNDCG 
and  P10  scores  with  TREC  results  (combining  both  automatic  and  manual  runs).  As  we  can  see,  system 
udelSUM  is  above  TREC  medians  for  the  majority  of  topics.  We  observe  similar  results  for  the  other  three 
runs. 

Table  7  shows  the  results  of  pairwise  one-tail  paired  t-test  on  infAP  for  our  four  submitted  runs.  The 
significance  scores  indicate  that  MRF  and  MeSH  are  both  very  effective  system  features. 

4  Conclusion  and  Future  Work 

For  2012  Medical  Records  track,  we  investigated  various  evidence  aggregation  methods,  explored  different 
query  expansion  collections,  and  combined  multiple  statistical  IR  models.  Our  systems  perform  well  com¬ 
pared  with  the  aggregated  TREC  results.  In  particular,  we  found  the  following  to  be  very  effective:  1) 
external  expansion  using  diverse  sources,  2)  models  that  incorporate  term  proximity  information,  3)  evi¬ 
dence  aggregation  at  both  report  and  visit  levels.  For  future  work,  we  plan  to  investigate  why  our  systems 
did  not  do  well  on  a  few  hard  topics. 
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